Dataloading Followup #3604

AntonioMacaronio · 2025-03-02T06:32:34Z

This PR is a followup to #3216 - its purpose is to add further documentation, fix bugs, and improve clarity

Problems and Background

One main issue is that on some CUDA versions and builds (such as WSL CUDA 11.8), GPU tensor serialization is very picky. When you put tensors on GPU in a worker process (via .to(self.device)), these GPU tensors cannot be properly serialized back to the main process! PyTorch attempts to serialize them but fails silently, resulting in zeroed tensors.
depth-nerfacto broken #3592 Updated the method_configs.py file to resolve this.
AssertionError when using ns-export #3586 ns-export cameras can also be resolved with the GPU fixes

Overview of Changes

Worker processes no longer move Pytorch tensors to the GPU. Workers keep tensors on the CPU until tensors are transferred to the main process, which moves them to GPU for forward/backward passes. This happens in both NeRF methods and 3DGS methods
New flowchart guides to train on large datasets!

TODOs

flowcharts

abrahamezzeddine · 2025-03-03T00:32:14Z

Hello,

I am getting this bug when using load from disk. It is intermittent and works again after re-launching training.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Scripts\ns-train.exe\__main__.py", line 7, in <module>
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 272, in entrypoint
    main(
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 257, in main
    launch(
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 101, in train_loop
    trainer.train()
  File "O:\Gaussian\nerfstudio\nerfstudio\engine\trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\utils\profiler.py", line 111, in inner
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\engine\trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\utils\profiler.py", line 111, in inner
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\pipelines\base_pipeline.py", line 298, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\data\datamanagers\full_images_datamanager.py", line 390, in next_train
    camera, data = next(self.iter_train_image_dataloader)[0]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\dataloader.py", line 1344, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\dataloader.py", line 1370, in _process_data
    data.reraise()
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\_utils.py", line 706, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\_utils\worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\data\utils\dataloaders.py", line 651, in __iter__
    data[k] = data[k].to(self.device)
              ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

akristoffersen

LGTM! I don't immediately see what would fix the cam_idx metadata from the cameras to show up here, but if you've validated it on your side its good with me!

akristoffersen · 2025-04-19T19:15:16Z

nerfstudio/data/utils/dataloaders.py

            Here, the variable 'batch' refers to the output of our pixel sampler.
                - batch is a dict_keys(['image', 'indices'])
-                - batch['image'] returns a pytorch tensor with shape `torch.Size([4096, 3])` , where 4096 = num_rays_per_batch. 
+                - batch['image'] returns a `torch.Size([4096, 3])` tensor on CPU, where 4096 = num_rays_per_batch. 


do we support rgba supervision here?

Yes this does! When an RGBA image is present in the dataset, it gets converted into RGB format in the InputDataset.

Specifically this is what happens:

dataloaders.py's RayBatchStream will call self.input_dataset.__getitem__

InputDataset's __getitem__() method calls self.get_data

get_data will call get_image_float32

get_image_float32 has the code for RGBA support

nerfstudio/data/datamanagers/full_images_datamanager.py

AntonioMacaronio · 2025-04-20T11:28:13Z

@akristoffersen It's a little obscure how the camera metadata gets fixed but when a worker process sends a pytorch tensor or a tensor dataclass object (like cameras) to the main process, this tensor or tensor dataclass obj has to be on the CPU device. If it is on the GPU device, it will have CUDA context errors and/or other unpredictable behavior like the ones @abrahamezzeddine experienced

Here are some links talking about this:
https://discuss.pytorch.org/t/cuda-initialization-error-when-dataloader-with-cuda-tensor/43390/9
https://discuss.pytorch.org/t/dataloader-multiprocessing-with-dataset-returning-a-cuda-tensor/151022

AntonioMacaronio added 3 commits March 1, 2025 22:28

fixing ns-eval when datamanager is set to disk for splats

217aee7

fixing prefetch_factor type issues if user decides to use 0 workers

fcee08d

removing accidental file

c6bf16c

AntonioMacaronio mentioned this pull request Mar 3, 2025

depth-nerfacto broken #3592

Open

fixing depth nerfacto config for issue 3592

bf752e0

AntonioMacaronio mentioned this pull request Mar 3, 2025

AssertionError when using ns-export #3586

Open

AntonioMacaronio added 7 commits March 3, 2025 02:23

ruff linting the imports

504e78b

fixing workers creating GPU tensors serialization error

3c66eb7

finished next_eval

fba6da3

adding flowcharts!

284192a

Merge branch 'main' into dataloading_followup

a7b7df2

fixing ruff

1a92b5f

added flowcharts into datamanagers.md

f87446f

AntonioMacaronio marked this pull request as ready for review April 17, 2025 07:29

simplying code and adding GPU serialization fix for both datamanagers

e1938e4

akristoffersen approved these changes Apr 19, 2025

View reviewed changes

AntonioMacaronio added 3 commits April 20, 2025 04:35

fixing nits with consistent styling

ad00c3a

last optional nit

28621f3

fixing num workers

943cc4e

AntonioMacaronio merged commit 94357f8 into nerfstudio-project:main Apr 20, 2025
3 checks passed

AntonioMacaronio mentioned this pull request Apr 22, 2025

Removing unnecessary print statement #3639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataloading Followup #3604

Dataloading Followup #3604

Uh oh!

AntonioMacaronio commented Mar 2, 2025 •

edited

Loading

Uh oh!

abrahamezzeddine commented Mar 3, 2025 •

edited

Loading

Uh oh!

akristoffersen left a comment

Uh oh!

akristoffersen Apr 19, 2025

Uh oh!

AntonioMacaronio Apr 20, 2025

Uh oh!

Uh oh!

AntonioMacaronio commented Apr 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dataloading Followup #3604

Dataloading Followup #3604

Uh oh!

Conversation

AntonioMacaronio commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abrahamezzeddine commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akristoffersen left a comment

Choose a reason for hiding this comment

Uh oh!

akristoffersen Apr 19, 2025

Choose a reason for hiding this comment

Uh oh!

AntonioMacaronio Apr 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AntonioMacaronio commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntonioMacaronio commented Mar 2, 2025 •

edited

Loading

abrahamezzeddine commented Mar 3, 2025 •

edited

Loading

AntonioMacaronio commented Apr 20, 2025 •

edited

Loading